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Abstract. We study policy iteration for infinite-horizon Markov deci- 
sion processes. It has recently been shown policy iteration style algo- 
rithms have exponential lower bounds in a two player game setting. We 
extend these lower bounds to Markov decision processes with the total 
reward and average-reward optimality criteria. 



1 Introduction 

The problem of finding an optimal policy for infinite-horizon Markov decision 
process has been widely studied ^ . Policy iteration is one method that has been 
developed for this task [5] . This algorithm begins by choosing an arbitrary policy, 
and then iteratively improves that policy by modifying the policy so that it uses 
different actions. For each policy, the algorithm computes a set of actions that 
are switchable, and it then chooses some subset of these actions to be switched. 
The resulting policy is guaranteed to be an improvement. 

The choice of which subset of switchable actions to switch in each iteration is 
left up to the user: different variants of policy iteration can be created by giving 
different rules that pick the subset. Traditionally, policy iteration algorithms use 
a greedy rule that switches every state with a switchable action. Greedy policy 
iteration will be the focus of this paper. 

Policy iteration has been found to work well in practice, where it is used as 
an alternative to linear programming. Linear programming is known to solve 
the problem in polynomial time. However, relatively little is known about the 
complexity of policy iteration. Since each iteration yields a strictly improved 
policy, the algorithm can never consider the same policy twice. This leads to 
a natural exponential bound on the number of iterations before the algorithm 
arrives at the optimal policy. The best upper bounds have been provided by 
Mansour and Singh [B], who showed that greedy policy iteration will terminate 
in 0(fc"/n) iterations, where k is the maximum number of outgoing actions from 
a state. 

Melekopoglou and Condon have shown exponential lower bounds for some 
simple variants of policy iteration [7] . The policy iteration algorithms that they 
consider switch only a single action in each iteration. They give a family of 
examples upon which these policy iteration algorithms take 2" — 1 steps. It has 
been a long standing open problem as to whether exponential lower bounds could 
be shown for greedy policy iteration. The best lower bound that has been shown 
so far is n -I- 6 iterations [5] . 



Policy iteration is closely related to the technique of strategy improvement for 
two player games. Friedmann [3] has recently found a parity game which forces 
which forces the strategy improvement algorithm of Voge and Jurdziriski [!T to 
take an exponential number of steps. It has been shown that this example can 
be used to show exponential lower bounds for strategy improvement algorithms 
for other prominent types of two player game [T|. 

Our contribution. Friedmann's example relies on the fact that there are two 
players in a parity game. We show how Friedmann's example can be adapted 
to provide exponential lower bounds for policy iteration on Markov decision 
processes. We present an example that provides an exponential lower bound for 
the total reward criterion, and we also argue that the same example provides an 
exponential lower bound for the average-reward criterion. 

2 Preliminaries 

A Markov decision process consists of a set of states S, where each state s e S* 
has an associated set of actions Ag. For a given state s € S and action a G Ag 
the function r(s,a) gives the reward for choosing the action a in the state s. 
Given two states s and s', and an action a G As, the function p(s'|s, a) gives the 
probability of moving to state s' when the action a is chosen in state s. This is 
a probability distribution, so X^s'es Pl^'ki a) = 1 for all s and a S A^. 

A policy TT : S —?' As is a, function that selects one action at each state. For 
a given starting state sq, a run that is consistent with a policy tt is an infinite 
sequence of states {sq, si, . . . ) such that p(si_|_i |si, 7r(si)) > for all i. The set f^s^ 
contains every consistent run from sq when n is used. A probability space can 
be defined over these runs using the cr-algebra that is generated by the cylinder 
sets of finite paths starting at sq. The cylinder set of a finite path contains 
every infinite path that has the finite path as a prefix. If we fix the probability 
of the cylinder set of a finite path (sq, si, S2, • • • Sfe) to be niLo^Csi+iIsi, 7r(sj)), 
then standard techniques from probability theory imply that there is a unique 
extension to a probability measure Ps(,(-) on the ct- algebra [3]. Given a function 
that assigns a value to each consistent run / : ^> M, we define {/} to be 
the expectation of this function in the probability space. 

The value of a state s when a policy tt is used varies according to the choice 
of optimality criterion. In the total reward criterion the value is Val'^(s) = 
Eg {^^Q r(si, Si+i)}, and for the average-reward criterion the value is Val^(s) = 

Eg {liminfjv->oo jfJ2iLo^(^iT^i+i)}- The computational objective is to find the 
optimal policy tt* , which is the policy that maximizes the value function for 
every starting state. We define the value of a state to be the value of that state 
when an optimal pohcy is being used. That is, we define Val(s) = Var*(s) and 
Val^(s) = Val^*(s) for every state s. 

For each optimality criterion, it has been shown that the value of each state 
can be characterised by the solution of a system of optimality equations ^ . For 



the total reward criterion these optimahty equations are, for every state s: 

Vis) = max(r(s, a) + V p{s'\s, a) ■ V{s')) (1) 
s'es 

For the average-reward criterion we have two types of optimality equation, which 
must be solved simultaneously. The first of these are called the gain equations: 

G(5) = max(Vp(s>,a)-G(s')) (2) 

a^As ^ — ' 
s'eS 

Secondly we have the bias equations, li Ms = {a ^ Ag : G{s) — J^s'es P(^'\^' 
G(s')} is the set of actions that satisfy the gain equation at the state s, then the 
bias equations are defined as: 

B{s) = max(r(s,a) - G{s) + V p{s'\s,a) ■ B{s')) (3) 

s'es 

We have that solutions to these equations characterise the value of every 
state. That is, we have Val(s) = V{s) and Val_4(s) = G(s), for every state s. 
We can also obtain an optimal policy by setting 7r*(s) = a, where a is an action 
that achieves the maximum in the optimality equation. 

3 Policy Iteration 

Policy iteration is a method for solving the optimality equations that we pre- 
sented in Section [21 We will begin by describing policy iteration for the total 
reward criterion. For every policy tt that the algorithm considers, it will com- 
pute the value Val'^(s) of the policy at every state s, and check whether this is a 
solution of the optimality equation ([1]). The value of the policy can be obtained 
by computing the solution to: 

Var(s) = r(s,a) + ^ p{s'\s, a) ■ Var (s') (4) 

s'es 

If the value of tt satisfies the optimality equation ((TJ at every state, then a 
solution has been found and the algorithm can terminate. Otherwise, we define 
the appeal for each action a € in the policy tt to be: Appeal (s, a) — r(s, a) + 
J2s'esP(^'\^^ a) • Val'^(s'). If the policy tt does not satisfy the optimality equation 
then there must be at least one action a at a state s such that Appeal'^ (s, a) > 
Val'^(s). We say that an action with this property is switchable in tt. Switching 
an action a G in a policy tt creates a new policy tt' where 7r'(s) = a if 
s = t, and 7r'(s) — tt{s) for every other state s. The set of switchable actions 
is important because it can be shown that switching any subset of switchable 
actions will create an improved policy. 

Theorem 1 {[8\) . If ir is a policy and vr' is a policy that is obtain by switching 
some subset of switchable actions in tt then YaV (s) > Val'^(s) for every state s, 
and there is some state in which the inequality is strict. 



Policy iteration begins by choosing an arbitrary policy for the MDP. In every 
iteration it computes the set of switchable actions, and then picks some subset 
of these actions to switch in the current policy. This creates a new policy which 
will be considered in the subsequent iteration. Since policy iteration only ever 
switches switchable actions, Theorem [T] implies that it cannot visit the same 
policy twice. This is because repeating a policy would require the value of some 
state to decrease. Since there are a finite number of policies, the algorithm must 
eventually arrive at a policy with no switchable actions. This policy clearly 
satisfies the optimality equation ([T]), and policy iteration can terminate. 

Note that any subset of of switchable actions can be chosen in each iteration of 
the algorithm, and the choice of subset affects the behaviour of the algorithm. In 
this paper we study the greedy policy iteration algorithm, which selects the most 
appealing switchable action at every state. For every state s where equation ([T|) 
is not satisfied, the algorithm will switch the action: argmax^g^ (Appeal^ (s, a)). 

Policy iteration for the average-reward criterion follows the pattern, but it 
uses uses optimality equations ^ and ^ to decide which actions are switchable 
in a given policy. For each policy it computes a solution to: 

s'es 

An action a e is switchable if either J2s'£sP(^'\^^^) ' G^W) > G'^i^) or if 
- G-(s) and: ii.s,a)~G^s) + j:^,^gp{s'\s,a) > B-{s). 

4 Exponential Lower Bounds For The Total Reward 
Criterion 

In this section we will describe a family of examples that force policy iteration 
for the total reward criterion to take an exponential number of steps. Due to the 
size and complexity of the example, we will break the example down into several 
component parts, which will be presented separately. 

The example will actually contain very few actions that are probabilistic. An 
action a G is deterministic if there is some state s' such that p(s'|s, a) — 1. 
For the sake of convenience, we will denote actions of this form as (s, s'). We also 
overload our previous notations: the notation 7r(s) — s' indicates that tt chooses 
the deterministic action from s to s', the function r(s, s') gives the reward of this 
action, and Appeal'^ (s, s') gives the appeal of this action under the policy tt. 

Since we are working with the total reward criterion, care must be taken 
to ensure that the value of a policy remains well defined. For this purpose, the 
example will contain a sink state c„+i that has a single action {cn+i, c„+i) with 
reward 0. This will be an absorbing state, in the sense that every run of the 
MDP from every starting state will eventually arrive at the state c„+i, for every 
policy that is considered by policy iteration. This will ensure that the value of 
each state remains finite throughout the execution of the algorithm. 



We will give several diagrams for parts of the example, such as the diagram 
given in Figure [TJ States are represented by boxes, and the name of a state is 
printed on the box. Actions are represented by arrows: deterministic actions are 
represented as an arrow from one state to another, and probabilistic actions are 
represented as arrows that split, and end at multiple states. The probability 
distribution is marked after the arrow has split, and the reward of the action is 
marked before the arrow has split. 

Our overall goal is to construct an example that forces policy iteration to 
mimic the behaviour of a binary counter. Each policy will be associated with 
some configuration of a binary counter, and the exponential lower bound will 
be established by forcing policy iteration to pass through at least one policy for 
every possible configuration of the binary counter. If the bits of this counter are 
indexed 1 through n, then there are two conditions that are sufficient enforce 
this behaviour. Firstly, a bit with index i should become 1 only after all bits 
with index j < i are 1 . Secondly, when the bit with index i becomes 1 , every bit 
with index j < i must be set to 0. Our exposition will be follow this structure: 
in section 23] we will describe how each policy corresponds to a configuration of 
a binary counter, in section [4.21 we will show how the first condition is enforced, 
and in section l43l we will show how the second condition is enforced. 



4.1 A Bit 



1 T 

(1071+4)2" 
1 

(10n+4)2" 




Fig. 1. The structure for the bit with index i. 



The example will contain n instances of structure shown in Figure [TJ which 
will represent the bits of a binary counter. We will represent the configuration 
of a binary counter as a set B C {1, 2, ... n} that contains the indices of the bits 
that are 1. A policy tt represents a configuration _B if 7r(6i) = for every index 
i G -B, and 7r(&i) ^ ai for every every index i ^ B. For a set of natural numbers 
B we define B^^ to be the set -B \ {fc G N : k < i}. We define analogous 
notations for the relations <, >, and <. 



The actions will be the only probabilistic actions in the example. It is 
worth noting that when the action is chosen at bi the effect, under the total 
reward criterion, is identical to a deterministic action {bi, gi) with reward 0. The 
fact that it takes an expected (lOn + 4)2" steps to move from bi to the Qi using 
the action is irrelevant because the reward of is 0, and these steps will have 
no effect on the total reward. 

Proposition 2. For every policy tt, if n{bi) — Oi then Val'^(6,;) — Val^{gi). 

The reason why the given probabilities have been chosen for the action is 
that the value of the state gi will never exceed (lOn + 4)2". 

Assumption 3 For every policy tt we have Val'^(fei) > and Val'^(gi) < (10n + 
4)2". 

Although the action Ui behaves like a deterministic action when it is chosen 
at bi, it behaves differently when it is not chosen. A deterministic action {bi,gi) 
would have Appeal^ {bi, gi) = Val'^{gi) in every policy. By contrast, when Oi is 
not chosen by a policy tt, we can show that the appeal of is at most Val'^(6i) + 1. 

Proposition 4. Suppose that Assumption holds. If t: is a policy such that 
TT{bi) ^ Ui then Appeal^ (6.;, a^) < Val^(6i) + 1. 

This is the key property that will allow us to implement a binary counter. 
The value of the state gi could be much larger than the value of bi. However, 
we are able to prevent policy iteration from switching the action by ensuring 
that there is always some other action x such that Appeal(fei, x) > 'Va.V{bi) + 1. 

4.2 Switching the Smallest Bit 

In this section we will give a full description of the example, and we will show 
how policy iteration can only switch the state bi to ai after every state bj with 
j < i has been switched to bj. 

Figure [2] shows one of the key structures in the example, which is called the 
deceleration lane. In the previous section we argued that an action {bi,x) with 
Appeal'^ a;) > Val^(6i) + 1 is required in every policy tt with Tr{bi) ^ ai to 
prevent policy iteration from switching the action a^. The deceleration is the 
structure that ensures these actions will exist. 

The states x and y both have outgoing actions that will be specified later. For 
now, we can reason about the behaviour of the deceleration lane by assuming 
that the value of y is larger than the value of x. 

Assumption 5 For every policy tt we have Val^(j/) > Val^(a;). 

The initial policy for the deceleration lane is the one in which every state dk 
chooses the action {dk,y)- It is not difficult to see that the only switchable action 
in this policy is {di,do). This is a general trend: the action {dj,dj-i) can only 
be switched after every action {dk,dk-i) with 1 < fc < j has been switched. 




Fig. 2. The deceleration lane. 



Therefore, policy iteration will take 2n steps to arrive at the optimal policy for 
the deceleration lane. Formally, for every j in the range < j < 2n we define: 



TTj{s) 



dk-1 ii s — dk and 1 < fc < j, 
y otherwise. 



(5) 



Proposition 6. Suppose that Assumption\S\ holds. Applying policy iteration to 
ttq produces the sequence of policies (ttctti, . . . ,1^2-0) ■ 




Fig. 3. The outgoing actions from the state hi 



Figure [3] shows how each state hi is connected to the deceleration lane. Of 
course, since we have not yet specified the outgoing actions from the states /i, 
we cannot reason about their appeal. These actions will be used later to force the 
state hi to switch away from the action as the binary counter moves between 
configurations. For now, we can assume that these actions are not switchable. 



Assumption 7 We have Appeal'^ /j) < Val^(6i) for every policy tt and ev- 
ery action {hi, fj). 



We now describe the behaviour of pohcy iteration for every index i ^ B. The 
initial policy for the state bi will choose the action {bi, y). In the first iteration the 
action {bi, d2i) will be switched, but after this the action chosen at bi follows the 
deceleration lane: policy iteration will switch the action {bi,dk) in the iteration 
immediately after it switches the action {dk, dfc-i)- Since r(6i, dk)+r{dk,dk-i) — 
r(&i,(ifc_i) + 1, this satisfies the condition that prevents the action Ui being 
switched at bi. Formally, for every j in the range < j < 2i + 1 we define: 

7rj(s) if s — dj for some j, 

y if = and s = bi 

d2i if i = 1 and s = bi 

dj-i ii 2 < j < 2i + 1 and s = bi. 

Proposition 8. Suppose that Assumptions\^\B^ and^hold. When policy iter- 
ation is applied to tTq it will will produce (tTq, n^, . . . T^2i+i)- 

We can now see why a bit with index i can only be set to 1 after all bits with 
index j such that j < i have been set to 1. Since each state bi has 2i outgoing 
actions to the deceleration lane, policy iteration is prevented from switching the 
action for 2i + 2 iterations. Therefore, policy iteration can switch at the 
state bi at least two iterations before it can switch aj at a state bj with j > i. 

The second important property of the deceleration lane is that it can be 
reset. If at any point policy iteration arrives at a policy ttj in which Val'^^ (x) > 
Val'^^ {y) + 6n + 1 then policy iteration will switch the actions {dk, x) for all k 
and the action {bi,x) for every i G -B, to create a policy tt'. The reason why 
these actions must be switched is that the largest value that a state dk or bi 
can obtain in a policy 7r° is VaF^ {y) + 6n + 1. Now suppose that YaT {y) > 

Val'^ {x) + An. If this is true, then policy iteration will switch the actions {dk, y) 
and the action {bi,y), and it therefore arrives at the policy 7rg. The ability to 
force the deceleration lane to reset by manipulating the difference between the 
values of y and x will be used in the next section. 

We now turn our attention to the states bi where i ^ B. Policy iteration 
should never switch away from the action Oi at these states irrespective of the 
state of the deceleration lane. Since we have not yet specified the outgoing actions 
of gi , we need to assume that the value of bi is large enough to prevent the actions 
{bi,dk) being switchable. 

Assumption 9 For every policy tt, if i G B then VaF(&i) > Yar'{y) + Qn + 1. 

With this assumption holds, the state bi will not be switched away from the 
action a^. Formally, for j in the range 2 < j < 2i we define: 

^/ N I T^As) if s = dj for some j, 

Proposition 10. Suppose that As sumptions and\^hold. When policy it- 

eration is applied to ttq it will produce the sequence (7rQ,7rJ, . . -TTgj). 
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Fig. 4. The structure associated with the state bi. 



Figure 13] shows the structure that is associated with each state bi. We com- 
plete the example by specifying the outgoing actions from x and y: there is an 
action (y, Ci) with reward for every i in the range 1 < i < n + 1, there is an 
action (x, fi) with reward for every i in the range \ < i < n, and there is an 
action (a;,c„+i) with reward —1. 

The idea is that the state Ci should use the action (c;, fi) only when the index 
i is a member of B. Moreover, the state r.; should use the action {ci,rj) where 
J e i? is the smallest bit that is both larger than i and a member of B. The 
state X should use the action (x, fj) and the state y should use the action (y, Cj) 
where j is the smallest index that is a member of B. 

Formally, for each configuration B we define a policy tt^ for these states. 
We define TT^{ci) = [ci, fi) if i £ _B and n^{ci) = {ci,ri) ii i ^ B. We define 
Tr^{ri) = {ri,Cj) where j = min(i3>* U {n + 1}). We define TT^{y) = {y,Cj) where 
j — min(i3 U {n + 1}). We define tt-^{x) — {x, fj) where j = min(i3) ii B ^ $, 
and we define Tr^{x) — {x, c„+i) if i? = 0. 

We can now define the sequence of policies that policy iteration will pass 
through for each configuration B. This definition combines the partial policies 
TTj, TT°, TTj, and vr^ to give a complete policy tt^. If i = min({i ^ B : 1 < i < n}) 
then we define Sequence(i3) = (vrf , tt^, . . . tt^+i), where: 

7rj(s) if s = dk for some fc, 
7r|(s) if s = bj where j G B, 
7r°(s) if s = bj where j ^ B. 
7r^(s) otherwise. 

We can now see why the assumptions that we have made are true in the full 
example. For example, in Assumption [3] we asserted that Yal^ {gi) < (10n-|-4)2". 




This holds for every pohcy ir^ because by following this policy from the state gi 
we pass through Ti followed by Cj , /j , 6j , , and for every index j £ before 
arriving at the sink state c„+i. Therefore, the value of the state gi under the 
policy 7rf can beat most XlL+i (10n + 4)(2' - 2'"!) + (10n + 4)2* = (10n + 4)2". 
The other assumptions that we have made can be also be shown to be true for 
every policy tt^ . 

Proposition 11. For every configuration B we have that Assumptions\^\^^ 
and\^hold for every policy tt in Sequence(-B). 

Our previous propositions have done most of the work in showing that if pol- 
icy iteration is applied ttq , then it will pass through Sequence(i?). To complete 
the proof it is sufficient to note that policy iteration never switches away from 
the policy tt^ at the states Ci, ri, x, and y. 

Proposition 12. When policy iteration is applied to tTq policy iteration will 
pass through the sequence of policies given by Sequence(i3). 



4.3 Moving Between Configurations 

In this section we will describe the behaviour of policy iteration after the fi- 
nal policy in Sequence(i3) has been reached. Throughout this section we define 
i = min({j ^ B : 1 < j < n}) to be the smallest index that is not in the 
configuration B, and we define B' = B\j{i} \ {1, 2, ... i — 1}. Our goal is to show 
that policy iteration moves from the policy tt^^j^ to the policy ttq . 

The first policy that policy iteration will move to is identical to the policy 
7r^_(_2, with the exception that the state bi is switched to the action a^. We define: 

j-^-j _ if s = 6j, 

[7i'^+2(s) otherwise. 

This occurs because the state hi only has 2i actions of the form {bi,dk). 
Therefore, once the policy tt^^^ is reached there will no be no action of the form 
{bi, dk) to distract policy iteration from switching the action a^. Since every other 
state bj with j ^ B has at least two actions {bj,dk) with k > 2i, they move to 
the policy 7rf,+2- 

Proposition 13. Policy iteration moves from the policy tt^^j^ to the policy tt^^. 

Since the action Oi has been switched the value of the state fi is raised to 
Var^i(r,) + (lOn + 4)(2* - 2*-^) - An. The reward of (lOn + 4)2' is sufficiently 
large to cause policy iteration to switch the actions {ci, fi) and (x,fi). It will 
also switch the actions {bj,fi) where for every index j < i. Since every index 
j ^ B other than i has at least one action {bj, Dk), these states can be switched 
to the policy 7r^+3(s)- Therefore, we define: 




) if s = bi or s = or s G {cj, hj, Vj : j > i} 

a s = X OT s — Ci 01 s = hj with j < i, 
(s) otherwise. 



The most important thing in this iteration is that every state bj with index 
j < i is switched away from the action aj . This provides the critical property of 
reseting every bit that has a smaher index than i. Another important property is 
that, while the action (x, fi) can be switched in this iteration, the action (y, Ci) 
cannot be switched until after the action {ci, fi) has been switched. This will 
provide a single iteration in which the value of x will exceed the value of y, 
which is the first of the two conditions necessary to reset the deceleration lane. 

Proposition 14. Policy iteration moves from the policy tt^j to the policy Trjjg. 

In the next iteration the deceleration lane begins to reset as policy iteration 
switches {dk, x) for all k and {bj,x) where j > i and j ^ B. Policy iteration also 
switches {y,Ci) and {rj,Ci) with j < i. We define: 



The switching of {y,Ci) provides the second condition for the reset of the 

B 

deceleration lane. After the action is switched the value of y will be 'Vay"^{fi) + 

4n + 1 whereas the value of x will be Val'^^^ (/;). Therefore, policy iteration will 
reset the deceleration lane in the next iteration. It is also important that the 
action {bj , x) for j ^ B is switchable in this iteration, since if i + 1 ^ B then bi+i 
will have run out of actions (6i+i, d^) to distract it from switching a^+i. The is 
the reason why each state bi must have 2i actions to the deceleration lane. 

Proposition 15. Policy iteration moves from the policy tt^2 policy 7r|j^. 

Finally, once policy iteration has reached the policy Tr^g it will move to 
the policy ttq . This involves completing the reset of the deceleration lane by 
switching (dk^y) for all fc, and switching the actions {bj,y) for every state bj 
with index j ^ B' . It also makes the final step in transforming the policy tt^ to 
the policy tt^ by switching the actions (cj, rj) at every state Cj with j < i. 

Proposition 16. Policy iteration moves from the policy tt^j to the policy tTq . 

When combined with Proposition [TH the propositions in this section imply 
that policy iteration will move from the policy tTq to the policy ttq . The optimal 
policy for the example is T^in+i where B = {1,2, . . . n}. This is the policy that 





if s e {cj,bj,rj : j > i} U {x}, 

s = y OT s = rj with j < i, 

s = dk for some fc or s = bj with j ^ B \ {x} 



.B 

2i+3 



(s) if s = Cj with j < i. 



selects ai at bi for all i, and in which the deceleration lane has reached its 
optimal policy. Our results so far indicate that if we begin policy iteration at 
the policy ttq, then policy iteration must pass through a policy ttq for every 
B C {1, 2, . . . n}. Therefore, it will take at least 2" iterations to terminate. 

Theorem 17. When policy iteration for the total reward criterion is applied to 
the policy tTq it will take at least 2" iterations to find the optimal policy. 

Finally, we can also argue that the example also provides an exponential 
lower bound for policy iteration for the average-reward criterion. The first thing 
to note that G'^(s) = for every policy that we have specified. This is because 
all runs eventually reach the sink state c„+i. Since the reward of the action 
(c„+i,c„+i) is 0, the long term average-reward of every policy will be 0. Note 
that when G'^{S) = 0, the bias optimality equation ^ becomes identical to 
the total reward optimality equation ([Ij. This causes policy iteration for the 
average-reward criterion to behave identically to policy iteration for the total 
reward criterion on this example. 

Theorem 18. When policy iteration for the average-reward criterion is applied 
to the policy tTq it will take at least 2" iterations to find the optimal policy. 
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A Proofs for Section 14.11 

Proof (of Proposition\^. Using the optimality equation given in ^ we get: 
Var(60 =r(6„a,)+^p(s|6„a,)Var(s) 

Var (6,) = + (1 - ^-^—) Var (6,) + , ^^'} 

^ ' ^ (10n + 4)2"'^ ^ (10n + 4)2" 

Var(6,) ^ Var(gO 

(10n + 4)2" (10n + 4)2" 

Var(6,) = Var(gi) 

□ 

Proof (of Proposition^. We have: 

Appeal(6i,aj) = r{bi,ai) + ^p(s|&i, a^) Var(s) 

ses 

2-(n) 2~(") 
-0+a-I^)Vam) + ^Var(,.) 

2-(ri) 

<Val-(6,) + ^^-^.(10n + 4)2" 
= Var(60 + l 

□ 

B Proof of Proposition [6] 

Proposition 19. When j < i + 1 we have 

Appear' (dj , _ 1 ) = VaF' (y ) + 4n - j + 1 . 

Proof. By the definition of tt.; we have that i^iidj) = dj^i for ah states dj with 
J < i, and we have that TTi{z) = y. Using the definition of appeal, and applying 
the optimality equation ^ repeatedly gives, for every action {dj^dj^i with 
j < i + 1: 

Appeal' (dj, = T{dj,dj^i) + Val^'(dj) 

3 

= ^ r(dfc, + r(z, y) + Var'(y) 

= -j + (4n+l)+Var-(2/) 

□ 



Proposition 20. When i + 1 < j < 2n we have AppeaV^ {dj , dj^i) = Va\^'{y) — 
1. 

Proof. By definition we have that ni{dj-i) — y. Using the definition of appeal 
and the optimahty equation gives: 

Appeal'^' (c?j, = r((ij,dj_i) + Val'^' 

= r{dj,dj-i) + r{dj^i,dy) + Var'(?/) 
= Var'(y)-l 

□ 

Proposition 21. IfVar'{x) < Var*(?/) then AppeaT"^ (dj , x) < Appeal"* (d-, , y) 
for all j. IfVaT^lx) > Var'(y) then Appeal"' > Appeal"' (dj, y) for all 
j- 

Proof. Using the definition of appeal for the state dj gives two equalities: 

Appear- {d, , y) = r{dj , y) + Val"' (y) 
Appeal"' {dj , x) = T{dj , x) + Val"' (x) 

Observe that for all j we have r(dj, y) = r{dj,x). Therefore we can conclude that 
when Val"'(2;) < Val"'(2/) we have Appeal"' (dj, x) < Appeal"' (dj , y) , and when 
Val"' (x) > Val"' (y) we have Appeal"' (dj , a;) > Appeal"' (d^ ,y). □ 

Proof (of Proposition^^. To prove this claim it is sufficient to show that policy 
iteration moves from the policy TTj to the policy TTj+i. We break this task into 
two parts: we will first show that {dk,dk-i) is the most appealing action at 
every state dk with 1 < fc < i + 1, and then we will show that {dk, y) is the most 
appealing action at every other state. In both cases, however, Assumption [5] 
combined with Proposition [5T] implies that Appeal"^ (d^, x) < Appeal"^ (d^, y). 
Therefore, the action (dfc,x) can be ignored. 

For a state dk with 1 < fc < i + 1, Proposition [19] combined with the fact 
that j < 2n gives: 

Appeal"^ (dfc, dfc_i) = Val"^ (y) + 4n - fc + 1 

> Val"^ (y) +2n + l 

> Var^ (y) = Appeal"^ (dfe,y) 

Therefore (dfc,dfc_i) is the most appealing action at these states. 

We now consider the other states. The state do only has the actions (do,y) 
and (do,x), and so this state can be ignored. The remaining states are those 
states dk with k in the range i + 1 < k <2i, and these states have an additional 
action (dfc,dfe_i). Proposition [TOl gives: 

Appeal"^ (dfc, dfc_i) = Val"^ (y) - 1 < Val"^ (y) = Appeal"^ (d^, y). 



Therefore (dfc,y) is the most appealing action at these states. 



□ 



C Proof of Proposition [8] 

Proposition 22. // tt is either tt° or ttj, then when \ < k < j we have 
Appear dk) = Var(y) + 4n + fc + 1. 

Proof. Since i:{dk) — T^jidk) for every state dfc, we can apply Propositfon [T9l to 
give: 

Appear(6„ 4) = r(6„ 4) + Val" (4) 

= 2k + (Var (y) + 4n - fc + 1) 
= Var(y)+4n + fc + l 

□ 

Proposition 23. // tt is either 7r° or 7rj^, f/ien w/ien j < k < 2i we have 
Appear (&„ dk) = Va^ (y) + 2fc. 

Proof. Using the definition of appeal, and the fact that 7r(4) = y when 1 < fc < 
2n gives: 

Appear (&„ dk) = r(&„ dk) + Var (4) 

= r(&„ 4) + r(4, 2/) + Var (y) 
= 2fc + Var(y) 

□ 

Proposition 24. Suppose that Assumptions\^and\S\hold. When policy iteration 
is applied to ttq it will move to the policy tt". 

Proof. To prove this proposition we must show that (6^, 40 is the most appeal- 
ing action at the state bi in the policy ttq. For the action {bi,x), Assumption [S] 
implies: 

Appears (6,, a;) = Var«(a;) < VarS(y) = Appear" (6^, y) 

For the action (&i,y), Proposition [531 combined with the fact that 4i > 1 gives: 

Appears (&„y) = Var«°(y) + i < Var«°(y) +4i = Appeal"" (&„ 4,) 

Therefore, the action (6i,4i) is more appealing than the actions {bi,x) and 
{bi,y). For the actions {bi,dk) with fc in the range 1 < fc < 2i, Proposition E51 
and the fact that k < 2i give: 

Appear" (&„ 4) = Var" (y) + 2fc < Var" (y) + 4i Appear"" (&„ 4«) 

Finally, for the action (6^,0^), Proposition H] combined with the fact that i > 1 
give: 

Appear"" (foj, a,) < Var"°(6,) + 1 
= Var'°(j^) + 1 

< Var"°(y) + 4i = Appears (&„ j^^) 
Therefore (bi,d2i) is the most appealing action at bi in the policy 7rg. □ 



Proof (of Proposition The fact that pohcy iteration moves from ttq to 7r° 
was shown in Proposition 1241 To complete the proof we must show that poUcy 
iteration moves from tt° to t^j^i for j in the range 1 < j < 2i. Proposition [6] 
imphes that this holds for the states dk , which means that we are only concerned 
with the action chosen at the state bi. For this state we must show that (bi, dj) is 
the most appealing action at the state hi in the policy ttj. Assumption [7] implies 
that the actions (hi, fj) can not be switched by policy iteration. 
For the action {bi,x), Assumption [5] implies: 

Appear^" (6„x) = Var^°(x) < VarHy) = Appear^" (6^, y). 

For the action {hi, y) the fact that j > and Proposition give: 

Appear^" {b„ y) = Val^° (y) + 1 < Val^° {y) + 4n + j + 1 = Appeal^" (&„ dj) 

Therefore the action (6.;, dj) is more appealing than the actions (6.;, x) and {bi,y). 

For the actions of the form {bi, dk) we consider two cases. For states dk with 
1 < fc < j we have by Proposition [2H and the fact that k < j: 

Appear^" (&„ dk) = YaT^ (y) + 4n + fc + 1 

< Var^" (y) + 4n + j + 1 = Appear^" {b^, dj) 

For states dk with j < k < 2i we have by Proposition [23l and the fact that 
2k < An: 

Appear^" (6j, dk) = Var^°(y) + 2k 

< Var^" (y) + 4ri + j + 1 = Appear^" {bi, dj) 

Therefore the action {bi, dj) is more appealing than the actions {bi, dk) for every 

Finally, we consider the action {bi,ai). Using Proposition 2] and Proposi- 
tion [51] gives: 

Appear^'(6j,ai) < VaT'^ {bi) + 1 = Appear^" + 1 

= Var^° {y) +4n + j + l = Appear^" {b, , dj ) 

Therefore the action {hi,dj) is more appealing than the action {bi,ai). We have 
now shown that the action {bi,dj) is more appealing than every other action at 
bi in the policy 7r°. □ 

D Proof of Proposition 1101 

Proof fof Provosition \Tm) . For the states dk this proposition follows from Propo- 
sition [51 To complete the proof we must show that {bi, Ui) is the most appealing 
action at bi in the policy 7r|. Assumption [7] implies that the actions {bi,fj) can 
not be switched by policy iteration. 



For the action {bi,x), Assumption [S] implies: 

Appear^' = N&r'i{x) < V&rHy) = AppeaF^' (6, , y) 

Furthermore, for the action {bi,y), Assumption 12] imphes: 

Appear^' (&„ y) = Ya^^y) + 1 < Y8i^^b^) = Appear^'(6„ a^) 

Therefore, the action {hi, a^) is more appeahng than the actions (6^, x) and [hi, y). 

For the actions (bi^dk) we consider two cases. Firstly, when 1 < fc < j we 
can apply Proposition [22l the fact that k < 2n, and Assumption [9] to give: 

Appear^'(6,, 4) < Yarhy) + 4n + fc + 1 

< Var^'(y) +6n+l< YaT^b,) = Appear^'(6„ a,) 

When j < k < 2i, Proposition [23l the fact that fc < 2n, and Assumption [9] give: 

Appear^'(6^, dk) < YaTHy) + 2k 

< YaVHy) + 4n < Ya^Hb,) = Appear^b^,a,) 

Therefore, the action (bi,ai) is the most appealing action at the state bi in the 
policy TTj. □ 

E Proof of Proposition 1111 

Proposition 25. Let B be a configuration and n be a member o/ Sequence(_B). 
For every i we have: 

Var. ^ fE,eB..(10n + 4)(2^ -2^-1) ^f^eB, 
\Ejei3>'(10" + 4)(2^ -2^-1) - 1 otherwise. 

Proof. We first consider the case where i G i?. If fc = min(i? U {n + 1}) then the 
definition of tt, and Proposition [2] give: 

Var (q) = r(c,;, /,) + r(/„ fe,) + Val" (&,) 

= r(cj,/,) +r(/i,fe,) +T{gi,ri) +T{ri,Ck) + Var(cfc) 

= (4n + 1) - ((lOn + 4)2*-i - 4n) + (lOn + 4)2* - 1 + Var (cfe) 

= (lOn + 4)(2' - 2*-i) + Var (cfe) 

If fc = n + 1 then we are done because VaF(c„+i) = 0. Otherwise, repeated 
substitution of the above expression for Val'^(cfc) gives: 

Var (c) = (lOn + 4)(2^' - 2^-^) + Var(cj+i) 
jeB>i 

= J2 (10n + 4)(2^' -2^'-i) 



We now consider the case where i ^ B. The definition of tt gives: 

Var (c) = r(c„ r,) + r(r„ c^) + Val" (c,) 
= Val^(cj) - 1 

= J2 (10n + 4)(2^' -2^'-i) - 1 



Proposition 26. Let B be a configuration and n be a member o/ Sequence (_B). 
For every i E B and j G B such that j > i, we have Val'^(ci) < Val'^(cj) + (107T.+ 
4)(2i-i - 2*-i). 

Proof. Let C — B-^ D B"^^ be the members of B that he between indices i and 
J — 1. Using Proposition [25] gives: 



^(lOn + 4) (2*^ - 2*^-1) + Val'^(cj) 



We use the fact that (lOn + 4)(2'= - 2'^-'^) > for all k and the fact that i > 
to obtain: 

^(lOn + 4)(2'^ - 2^-^) < ^(lOn + 4)(2'= - 2''-^) 

keC k=i 

= (10n + 4)(2^-i -2*-i) 

Therefore, we have: 

Var (c) < Var (cj) + (lOn + 4)(2^-i - 2'-^) 



Proposition 27. Let B be a configuration and tt be a member o/ Sequence(i3). 
Lor every i £ B and j £ B such that j > i, we have Val'^(ci) > Val'^(cj). 

Proof. Let C = B-' n B"^^ be the members of B that lie between indices i and 
j — 1. Proposition E51 and the fact that 2^ — 2^^^ is positive for every j imply: 

Var (c) = ^(lOn + 4)(2^' - 2^-^) + Val^ (c^) > Var (c^-) 



Proposition 28 (Proof of Assumption [3]). For every configuration B and 
every policy tt in Sequence(i?) we have Val'^(6i) > and\a\" {gi) < (10n + 4)2", 
for all i. 



Proof. To prove that Val^(6i) > we must consider two cases. When i ^ B and 
k = min(i3 U {n + 1}) then we can apply Proposition [2] to obtain: 

Val''(6i) = Var (g,) = Var(ri) + (lOn + 4)2* 
= Var(cfc) + (10n + 4)2* - 1 

Since z > we have (lOn + 4)2* — 1 > 0, and we must therefore argue that 
Val'^(c/c) > 0. If fc = n + 1 then we are done because Val'^(c„+i) = 0. Otherwise, 
we can apply Proposition 1251 to give: 

Var(cfc) = (10" + 4)(2^' - 2^'"^) 

This summation is clearly positive, since (lOn + 4) (2-' — 2^~^) > for every j. 

We will now show that Va.V{bi) > in the case where i ^ B. In this case we 
have 7r(6i) = dk for some k, and that TT{di) = di-i for all I in the range 1 < / < A:. 
We can therefore apply Proposition [T^ and Proposition [^B] to give: 

Var (6,) 4n + fc + 1 + Var(2/) 

= 4n + fc+l+ (10n + 4)(2^ - 2^"^) 

Since fc > we have that (4n + fc + 1) > 0, and we have already argued that the 
summation will be non- negative. This implies that the entire expression will be 
positive. Therefore, we have shown that Val'^(6i) > for all i. 

Finally, we argue that YaTigt) < (lOn + 4)2". If fc min(S U {n + 1}) then 
we have: 

VaTigi) = (lOn + 4)2* + Va^ (r,) = (lOn + 4)2* - 1 + Va^ (cfe) 

If fc = n + 1 then we are done because VaF(c„+i) — and (lOn + 4)2* — 1 < 
(lOn + 4)2" for all i < n. Otherwise, we can apply Proposition [5S] and the fact 
that fc — 1 > i to obtain: 

Var (5,) < (lOn + 4)2* - 1 + Var (c„+i) + (lOn + 4) (2" - 2''-^) 

< Var (c„+i) + (lOn + 4)2" - 1 

< (10n + 4)2" 

□ 

Proposition 29 (Proof of Assumption [5|). For every configuration B and 
every policy tt in Sequence(-B) we have YaVly) > Yal^{x). 

Proof. We first consider the case where _B = 0. In this situation the definition of 
TT implies that 7r(y) = c„+i and 7r(x) — c„+i. Therefore, we have: 



Var(x) = -1 + Var(c„+i) < Var (c„+i) = Var (y) 



We now consider the case where B By definition of tt we have that there 
is some index i £ B such that Tr{y) = Ci and 7r(x) = fi. Moreover, since i £ B 
we have that 7r(ci) = fi. We therefore have two equalities: 

Var(2/) = r(2/, c,) + r(c„ /,;) + Val"(/,) = Val"(/,) + An + 1 
Yal^x) = r(x, /,) + Val"(/,) = Val"(/0 

Clearly, since 4n + 1 > we have YaTiy) > YaTix). □ 

Proposition 30 (Proof of Assumption [7]). Let B be a configuration and 
TT be a member of Sequence(i?). // {bi, fj) is an action such that j £ B then 
Appear(6„/,) < Var(&,). 

Proof. To prove this proposition we must consider four cases. Firstly, when j £ B 
and i ^ B we can apply Proposition [521 the fact that fc > 0, the fact that 
min(i?) < j, and Proposition [57] to give: 

Var(6,) = Var (y) + 4n + fc + 1 > Va^ (c„,i„(s) ) + 4n + 1 

> Var(cj) +4n + 1 
= Var(/j) +8n + 2 

> Var(/j ) + 4n + 1 = Appear (6„ /,) 

Secondly, we consider the case where j £ B and i e i?. In this case we can 
apply Proposition [21 the fact that min(i?-*) < j, Proposition [571 and the fact 
that i > to obtain: 

Var (6,) = (lOn + 4)2' + Val^ (r,) 

= (10n + 4)2'-l + Var(c^i„(B>.)) 

> (10ri + 4)2' - 1 + Var(cj) 
= {lOn + 4)T + 4n + VaV{fj) 

> Var (/, ) + 4n + 1 = Appear (6„ /,) 

Thirdly, we consider the case where j ^ B and i ^ B. In this case, the fact 
that Tr{bi) = 7r(6j) gives: 

Appear (&„ /,) = 4n + 1 + Var (/j ) 

= -(lOn + 4)2^-1 + 1 + Var 

= -(lOn + 4)2^-1 + 1 + Var (7r(&j)) 

= -(lOn + 4)2^-1 + 1 + Var(7r(6,)) < Var (7r(6,)) 

Finally, we consider the case where j ^ B and i E B. Proposition [2l and the 
fact that k < 2n, imply: 

Appear (6„ f^) = -{lOn + 4)2^' + 1 + Var (fe^) 

= -(lOn + 4)2^' + 4n + fc + 2 + Var (y) 
< -(lOn + 4)2^' + 6n + 2 + Var (c„,i„(s)) 



Let I — mm{B^^ U {ri + 1}) be the smallest bit in the configuration that is larger 
than j. By Proposition and the fact that there is no bit in the configuration 
with an index m in the range j < m < I, we have: 

Var (c^i„(B)) = (10" + 4)(2^" - 2^"-!) + Var(cO 

< Var(cO + (lOn + 4)(2J'-i - 2") 

Therefore, we have: 

Appear(fe,, /j) < Var (q) + (lOn + 4){2^-^ - 1 - 2^) + 6n + 2 

< Var (q) + (lOn + 4)(2^-i - 2^') 

< Var(cO 

However, Proposition [21 Proposition [^H and the fact that i > imply: 

Var (6,) = (lOn + 4)2' + VaVin) = {lOn + 4)2* - 1 + Var(c„i„(B>.)) 

> (10n + 4)2' - 1 + Var(ci) 

> Var(cO 

□ 

Proposition 31 (Proof of Assumption [9]). For every configuration B and 
every policy n in Sequence(i?) we have Val'^(y) + 6n + 1 < Val'^(6i), for every 
index i Cz B. 

Proof. If i3 = then there are no indices i £ B, and so the proposition is vacu- 
ously true. Otherwise, the definition of tt implies that 7r(7/) = c^iin(B)- Applying 
Proposition [51] gives: 

Var(y)=Var(c^in(B)) 

< YaV{ci) + (lOn + 4)(2'-i - 2™"(-S)-i) 

Since i d B we have 7r(ci) = fi. Therefore we can apply the optimality equation, 
Proposition B51 and the fact that min(i3) — 1 > to obtain: 

Var (y) < Var (q) + (lOn + 4)(2*-i - 2"'"'^^^^^) 
< Var (c) + (lOn + 4)(2*-i - 2") 
= Var (/,) + (4n + 1) + (lOn + 4)(2*-i - 1) 

= Var (6j) - (lOn + 4)2*"^ - 4n + (4n + 1) + (lOn + 4){2'-^ - 1) 
= Var(6,) - (lOn + 3) 



It is now clear that YaV {y) +6n+l< Var(6i)- 



□ 



F Proof of Proposition [T2] 



Proposition 32. Let B be a configuration and it he a member o/ Sequence(i?). 

If there are two indices i Cz B U {n + 1} and j € B U {n + 1} such that i < j, 
then Var(cO > YaTicj). 

Proof. We first prove the statement for tlie states Ci and cj. By Proposition [25l 
we liave: 

Val^(cfe)= J2 (10n + 4)(2'-2'-i). 

Since (lOn + 4)(2* - 2'-'^) > we have: 

Val-(c,)= J2 (10n + 4)(2'-2'-i) 

< (10n + 4)(2' -2'-i) + (10n + 4)(2*-2'-i) 

< ^ (10n + 4)(2'-2'-i) =Var(cO 

□ 

Proposition 33. Let B be a configuration and n be a member o/Sequence(-B). 

For every subset C C BU{n+l}, if i = min(C) then we have Val'^(ci) > Val'^(cj) 
for every j ^ i. 

Proof. Proposition implies that the state Ci has a higher value than every 
other state Cj with j G C. To complete the proof we must eliminate the states 
Cj with j ^ C. We will accomplish this by arguing that for every such state 
Cj there is some index k G C such that Val'^(cfc) > VaF(cj). We choose k = 
min(C^^ U {ri + 1}) to be the smallest index in C that is higher than j, or the 
index of the sink if j is the largest index in C. Since j ^ C we have: 

Var (c,) = r(c„ r,) + r(r,, c^) + Val" (cfc) 
= Var(cfc)-l< Var(cfc) 

□ 

Proposition 34. Let B be a configuration and n be a member o/ Sequence(_B). 
Ifi^B we have Var(/,) + 477, + 1 < Var(cfc) where k = min(B>* Un + 1). 

Proof. Using Proposition [221 and the fact that i < n, to obtain the value of 
gives: 



Var(/,) = -{lOn + 4)2*-i - 4n + 6n + 1 + Yar{y) 
= ~{10n + 4)2*-i + 2n + 1 + Var (y) 



Using Proposition to obtain the value of y in terms of the state Ck gives: 
Var(/,) = -(10n + 4)2'-i+2n+l+ ^ (lOn + 4)(2^' - 2^'-i) + Var(cfc) 

< -(lOn + 4)2'-i + 2n + 1 + ^(lOn + 4)(2^' - 2^-^) + Var (cfc) 

3=0 

= -(lOn + 4)2'-^ + 2n + 1 + (lOn + 4)(2^-i - 2") + Val^ (cfc) 
= 2n + 1 + -(lOn + 4) + YaT (ck) 
= -8n-3 + Var(cfc) 

Therefore Var (/,) + 4n + 1 < Var(cfc). □ 

Proposition 35. Let B be a configuration and it he a member o/ Sequence(-B). 
If i = min(_B U {n + 1}) then we have Yal^{fi) > Var(/j) for every j ^ i. 

Proof. We begin by arguing that VaV{fi) > Va,l'^{fj) for the case where j E B. 
Since YaV {ck) = Va^ (/fc) + 4n+ 1 for every k G B,we can apply Proposition 15^ 
to obtain: 

Var(/,) = Var(ci) - 4n - 1 > Var(cj) - 4n - 1 = Var(/j) 

We now complete the proof by arguing that for every state fj with j ^ B there is 
some state fk with k E B such that YaT {fk) > Ya.\^{fj). Proposition [34l implies 
that Var(/i) + 4n + 1 < Var(cfc) where k = min(S U ti + 1). Therefore: 

Var (/j) < Var (cfe) - (4n + 1) = Var (/fe) 

□ 

Proposition 36. Let B be a configuration and it be a member o/ Sequence(i?). 

The state ci will not be switched away from Tr{ci). 

Proof. First we will consider the case where i € B, where we must show that 
the state c; does not switch away from the action (ci^fi). In this case, the fact 
2' - 2^-1 > implies: 

Appear (c„ /,) - (lOn + 4)(2^ - T'^) + 1 + Var (r,) 
> Var(r,) = Appear (c,,r,) 

Therefore, policy iteration will not switch away from the action (q, fi). 

Now we consider the case where j ^ i?. In this case Proposition IMl implies 
that iik = min(B>' U n + 1) then: 

Appear (c„ /,) = 4n + 1 + Var (/,) < Var (cfc) < Var (cfc) - 1 

We also have that Var(ci) = Var(cfc) — 1. Therefore, policy iteration will not 
switch away from the current action at Cfe. □ 



Proof (of Proposition]!^. Our previous proofs have shown that this proposition 
is true for every state not contained in the set {q, fi, gi,ri : 1 < i < n} U {x, y}, 
which this proof will deal with. Note that since the states fi and gi have only 
one successor they can be ignored. The rest of this proof is dedicated to showing 
that the states ct, r^, x, and y do not switch away from their current action for 
every policy tt in Sequence(i?). For the state q this proposition is a consequence 
of Proposition [221 

For the state r^, we must show that the most appealing action is (r.^, Cj) where 
j = mm{B>^ U{n+ 1}). When C = B>^ U {n + 1}, Proposition [33] implies that 
VaF(cj) > VaF(cfc) for every j > k. Since every outgoing action from has 
the same reward this also implies that the action {ri,Ck) is the most appealing 
action at r^. 

For the state x, we must show that the most appealing action is (x, fk), where 
k = mm{B U n + 1). Proposition (351 implies that Val'^(cfc) > Val'^(cj) for every 
j ^ k. Since every outgoing action from x has the same reward this also implies 
that the action (x, Ck) is the most appealing action at x. 

For the states y, we must show that the most appealing action is {y,Ck), 
where k = min(i3 U n + 1). When C = BU {n + 1}, Proposition 1551 implies that 
Val'^(cfc) > Val'^(cj) for every j ^ k. Since every outgoing action from y has the 
same reward this also implies that the action (j/, Cfc) is the most appealing action 
at y. □ 

G Proof of Proposition [131 

Proof. For every state other than hi the proof is identical to the proof that policy 
iteration moves from the policy tt^ to the policy T^fj^i for j < 2i. To complete 
the proof we must argue that the action {bi,ai) is the most appealing action at 
the state bi. We will first prove that every action other than {bi,ai) cannot be 
switched by policy iteration at the state bi. For the actions (bi,dk) we have by 
Proposition 

Appear^'^'+i {bi, dk) = Yar^'+^ (y) + 4n + fc + 1 

< Yar^'+^ (y) +4n + 2i + l 

= Appear5+i (6„ ^2.) = VaK'+i (6,) 

This implies that no action of the form (bi,dk) can be switched by policy itera- 
tion. Similarly, for the action (bi , y) we have: 

Appear^'^'+i (6^,2/) = VaF^'+i (y) + K Yar"+^ (y) + 4n + 2i + 1 = Yar^^+^ 

For the action {bi,x), Proposition [29l gives: 

Appear2'''+i(&j,a;) = YaT^^+^x) < Var?'+i(y) < Var?'+i(&,) 

Proposition 1301 implies that the actions {bi, fj) cannot be switched by policy 
iteration. 



Now we must prove that the action (6^, a^) can be switched by pohcy iteration. 
We begin by showing that Var^^^+i (g^) > Var^'.+i (6^). Let k = min(B>^U{n+l}) 
be the smaUest index in B that is bigger than i, or the index of the sink if i is 
the highest bit. Using Proposition [221 the fact that i < n, and Proposition [251 
gives: 

Var?'+i (bi) <6n + l + Var^'^'+i (y) 

= 6n + l+ {I0n + i){2^ -2^-^) +Yar"+^{ck) 
jeB<' 
i-i 

< 6n + 1 + ^(10n + 4)(2^' - 2^-^) +V&r^^+^{ck) 

^6n + l + (lOn + 4)(2*-i - 2°) + YaT^^+^ick) 
= {lOn + 4)2'-^ - 4n - 3 + VaF^'^'+i (cfe) 

The value of the state gi is: 

Var^+i (^^) = (lOn + 4)2' + Var?'+i (5,) = (lOn + 4)2^ - 1 + Var?'+i (cfc) 

Since (lOn + 4)2' - 1 > (lOn + 4)2'"^ - 4n - 3 for every i, we have that 
Var?*+i(gi) > Var2''+i(foj). Now we can conchide: 

B 2^" B 2^" -B 

lOn + 4 lOn + 4 

9-" B 2^" B 

>(1 -^^) Val-.^(M + ^ Val-.KM 
= Var?'+i(6,) 

Therefore, the action {bi, at) wiU be switched by pohcy iteration. □ 



H Proof of Proposition [14[ 

Proof. For the state hi the proof that a.; is the most appeahng action is identical 
to the proof given for Proposition 1131 For every state other than bi, Ci, x, or the 
states bj with j < i, the proof that policy iteration moves from 7r^^2 T^fi+s 
is identical to the proof given for Proposition lT2l 

For the state Ci we must show that the action {ci,fi) is the most appealing 
action. Using Proposition [21 and the fact that 2* — 2'^^ > for every i gives: 

AppearS2(c,, j^) = (io„ + 4)(2» - 2'"^) + 1 + Var^2(n) 
> Varn2(r.) = Appear^2 (cj,ri) 

Therefore (q, fi) is the most appealing action at the state 



For the state x we must show that the action {x,fi) is the most appeahng 
action. Note that tt^^ = T^2i+2 fo'" every state except the state hi. A proof that 
is identical to the one given for Proposition |35] can be used to conclude that if 
k = min(S U {n + 1}) then VaF^i {Jk) > Var^i (/,) for every j ^ i. Since every 
outgoing action from x has the same reward it is therefore sufficient to argue 
that VarHi(/^) > VarSi(/fc)- 

If / = min(_B>' U {n + 1}), then Proposition [5] implies: 

Var^i (/,) = (lOn + 4)(2* - T'^) - 4n - 1 + Va^^i (q) 
Moreover, we can express the value of fk as: 

Var^i ifk) = (1071 + 4)(2^' - 2^-^) - 4n - 1 + Var^i (c;) 

jGB<i 
i-1 

< ^(lOn + 4)(2^' - 2^'-i) - 4n - 1 + VaT^'^ (c;) 
i=i 

= (10n + 4)(2''"i - 2°) -4n - 1 + Var^i(Q) 

Since (10n + 4)(2'-2*-i) > (10n + 4)(2*-i - 2°) for every i > we can conclude 
that (x, fi) is the most appealing action at the state x. 

For the states bj with j < i we must show that the action (bj, fi) is the most 
appealing action. A proof that is identical to the proof given for Proposition [TU] 
can be used to show that the action {bi,x), the action (bj,y), and the actions 
of the form {bj,dk) will not be switched by policy iteration in the policy tt^i- 
Moreover, a proof that is identical to the proof of Proposition [30] can be used 
to show that the actions of the form {bj, fk) with k i cannot be switched by 
policy iteration in the policy tt^-^ . To complete the proof we must therefore argue 

that Appear^i(6j,/,) > Var(6j). 

We have previously derived an expression for the value of fi in terms of the 
state ci, where / = min(i3>' U {n + 1}). We can use this to obtain: 

AppeaF^i (6^, /,) = (10?i + 4)(2' - T'^) + Var^i (q) 
= (lOn + 4)2^-1 +Var^i(ci) 
We can also express the value of the state bj as: 

Var^i(foj) (10n + 4)2J' - 1 + ^ (lOn + 4)(2'= - 2*^-1) + Var^i (q) 

keB>inB<i 

< (lOn + 4)2^' - 1 + (10n + 4)(2'= -2'=-!) + Var^i(Q) 
k=j+i 

= (10n + 4)2^' - 1 + (10n + 4)(2*-i - 2^' ) + Va^i (c; ) 
= (10n + 4)2*-i -l+Var^i(Q) 



Since (lOn + A)2'-^ > (lOn + 4)2'-^ - 1 we have that AppeaT^^ {bj , f,) > 

B 

Va,V^^{bj). This implies that the action {bj,fi) will be switched by policy it- 
eration at every state bj with j < i. □ 



I Proof of Proposition [TS 

Proposition 37. We have Var«2(y) + 6n + l < Var««(x). 

Proof. Let / = min(_B U {n + 1}). We first consider the case where / < i. It is not 
difficult to see that if / < i then 1 = 1, since i is the smallest index that is not 
contained in B. In this case we can express the value of y in terms of the value 
of the state fi as: 

(lOn + 4)2'"^ + 4n + 2 + Va^^^ (/,) 
(lOn + 4) + 4n + 2 + Va^^^ (/,) 
(lOn + 4) + 4n + 2 + Var^^ (/,) 
6n-2 + VarS2(/,) 

Moreover, we can express the value of x as: 

Var^2(x) = var^2(/o 

Therefore, we have VaF^" (y) + 6n + 1 < YaT^^ (x). 

The second case that we must consider is when I > i, which occurs only when 
i = 1. In this case we can express the value of y in terms of the value of q as: 

Var^=(?/) = Var^^(ci) 

Similarly, we can express the value of x in terms of the value of q . Our derivation 
uses the fact that i = 1. 

Var^^ (x) = -(lOn + 4)2*^1 - 4n + (lOn + 4)2'' - 1 + VaF^^ (q) 
= (lOn + 4) - 4n - 1 + VaF^^ (q) 
= 6n + 3 + Var^2(ci) 

Once again it is clear that Var^^ (y) + 6n + 1 < VaF^^ (x). □ 

Proof (of Provosition \Wjl . For the states in the set {cj,rj : j > i} U {bj : j G 
B and j > z} U {x} the proof that policy iteration does not switch away from 
the action chosen by ttq is identical to the proof given for Proposition [T2] For 
the states rj with j < i, the proof that {rj,Ci) is the most appealing action at 
rj is identical to the proof given for Proposition [T^ The proof that the states 
bj with j < i do not switch away from the action {bj, fi) is very similar to the 



proof given for Proposition [TJ] that (bj, fi) was the most appeahng action in the 
pohcy Trfi. 

For the state Cj with j < i we must show that pohcy iteration does not switch 
away from the action {cj,fj). The first case that we consider is when there is 
some other index I E B m the range j < I < i. In this case we have: 

Appear«2(cj, rj) = -(lOn + 4)2'"^ + 4n + 1 + YaT^^if,) 
< -{lOn + 4)2^-1 + 4n + 2 + VaK^ (/,) 
= Appear^^(c„/,) 

The other case that must be considered is when j is the largest index in B that 
is smaher than i. If / = min(i? U {n + 1}) then we have: 

Appear^^ (cj,rj) = Va.r^-^{ci) - 1 

< Var^2(c;) + (lOn + 4) (2'' - 2'-^ - 2^'-^) + 4n + 1 
= Appear^2(c„/,) 

For the state y we must show that the most appeahng action is {y, Ci). For the 
actions of the form (y, Cj) with j > i we can use the same argument that was used 
in the proof of Proposition[T2]to argue that AppeaF^^ [y, Ci) > AppeaF''^ {y, cj). 
For the actions {y,Cj) with j < i we have 7r^2(ci) — fj and 7r^2(^i) — fi- 
Therefore, we can express the appeal of {y,Cj) as; 

AppeaK^ (y, 9) = -(lOn + 4)2^-'^ + 4n + 2 + YaT^^ (/,) 

Since 7r^2(cj) — ft ^6 can also express the appeal of the action {y,Ci) as: 

Appear^^ {y, c,) = 4n + 1+ YaT^^ ( f,) 

Since —{lOn + 4:)2^~^ + 1 < for every j > we have that the action {y,Ci) is 
the most appealing action at y. 

For the states dk we must show that the most appealing action is {dk , x) . 
Proposition I37[ and the fact that r((ifc,x) — T{dk,y) for every fc, imply: 

Appear^2 (y) r{dk,y) + Var^^ (y) < y) + Var^^ (2;) ^ Appear^^ (3,) 

Every state dk with fc > 1 has an additional action {dk,dk-i), for which we 
consider two cases. When 1 < fc < 2i + 4 we have by Proposition \W\ and Propo- 
sition |2Z1 give: 

Appear^2(dfc,dfc_i) = 4n - fc + 1+Yar^^{y) < YaT^^^x) = AppeaT^^ (dk, x) 
In the case where k > 2i + 4, Proposition!^ and Proposition [37] imply: 

Appeal"«^(dfe,4-i) = Var^^(y) - 1 < Var^^(x) = Appear^^((ifc,dfe_i) 



We have therefore shown that the action {dk,x) is the most appeahng action at 
the state dk- 

Finally, we must show that the action (bj, x) is the most appealing action at 
every state bj with j ^ B\{i}. The first case that we consider is when j > i + In 
this case a proof that is identical to the proof given for Proposition[5]can be used 
to show that the action ^21+4) is more appealing than every action other than 

B B 

{bj,x). We must therefore argue that Appeal'^^^ (6j , a;) > AppeaV^^{bj,d2i+4). 
Using Proposition!^ and Proposition 1571 gives: 

Appear^^ {bj , ^2^+4) < Var^^ (y) + 6n + 1< VaT^^ {x) = Appea^^^^ {bj , x) 

For the case where j = i + 1 there is no action {bj,d2i+4). Once the tech- 
niques used in the proof of Proposition [5] can be used to show that {bj,aj) 
is more appealing then every action other than {bj,x). We must argue that 

Appear^2 (6j , x) > Appear^^^fe^, a^). Using Proposition HI Proposition and 
Proposition |37] gives: 

AppeaT^^ {bj,aj) < YaV^^ib^) + 1 

< Var^^ (y) +6n + 2 

< Var^2(a;) = Appeal(6j,x) 

Therefore, policy iteration will switch the action {bj,x) at every state bj with 
3iB. □ 



J Proof of Proposition 1161 

Proof. For the states in the set {cj : j > i} U {bj : j E B and j > i} U {rj : 
1 ^ J ^ ^} U {a;, y} the proof that policy iteration does not switch away from 
the action chosen by ttq is very similar to the proof given for Proposition [121 
The proof that the action (cj , rj ) is the most appealing action at the states Cj 
with j < i is very similar to the proof given for Proposition 1361 

We must show that {dk,y) is the most appealing action at every state dk- 
For the action {dk,dk-i) we have: 

Appear«3(dfc,4-i) = Var^-^(x) - 1 < YaV^^ix) = AppeaV^^dk, x) 
For the action {dk,y), the fact that T{dk,y) = T{dk,x) implies: 

Appear^^ (4, x) = T{dk,x) + Val(/,) 

< r(4,2/) +4n + 1 + Val(/,) = Appea^^^' (4 , 2/) 

Therefore, the action {dk,y) is the most appealing action at the state d^- 

For the states bj with j ^ B' we must show that (bj, y) is the most appealing 
action. We first consider the case where j > i. For the actions {bj,dk) we have: 

Appear^^ {bj,dk) < 4n + YaV^Hf^) < 4n + 2 + YaV^^if^) = Appea^^^ (6^ , y) 



For the action {bj,x) we have: 



Appear^^^J {bj , x) = Yal''^^ (fi) < VaA"^^ {fi) + 4n + 2 = Appear^^* {bj , y) 

A proof that is similar to the proof of Proposition [501 can be used to show that 
the actions of the form (bj, fk) wiU not be switched by poUcy iteration. FinaUy, 
for the action {bj,aj) we have by Proposition 21 

Appear^3(&j ,aj) < Var^3(5^.) + i 

= Var-'(/.) + l 

< Var«3 (/^) + 4n + 2 = Appear^^ (5^. ^ 

Therefore {bj,y) is the most appeahng action at the states bj with j ^ B' and 
j > i. 

FinaUy, we consider the states bj where j ^ B' and j < i- The proof that the 
actions {bj,dk) and the action {bj,x) are less appealing than the action (bj^y) 
is identical to the proof that was given for the states bj with j ^ B' and j > i. 
For the actions {bj, fk) with k > i a proof that is similar to the proof given for 
Proposition [30l can be used to show that these actions will not be switched by 
policy iteration. For the action (bj,fi) we have; 

Appear^^'(6j,/j) in + l+YaT^^ifi) < 4n + 2 + Var^-^(/,) = Appear^-^(&j, y) 

For the action {bj,aj) we apply Proposition SI to give: 

Appear^^ {bj , a^) < Var^^ (5^.) + 1 = Var^^J (/») + 4n + 2 = Appear^^ {bj , y) 

Therefore (bj,y) is the most appealing action at the states bj with j ^ B' and 
j <i. a 



Proof (of Theoreml 1 8\) . It can easily be verified that, for every policy tt that total 
reward criterion policy iteration algorithm considers we have, for every state s. 



This implies that G^{s) = for every such policy. When G'^{S) = the bias 
equation given in ^ becomes: 



This is identical to the equation ([1]). Therefore, we have Val'^(s) = B^{S) for 
every policy tt. 



K Proof of Theorem [H 




B^'is) = max(r(s,a)+ ^ p(s'|s, a) ■ B'^(s')) 



Policy iteration decides whether an action is switchable by lexicographicaUy 
comparing the gain and bias. Since G'^{s) = G'^{s') for every pair of state s 
and s', and every pohcy tt, we have that J2s'esP(^'\^^'^)^'^(^') ~ ^^i^) ^'^^ 
every pohcy tt, every state s, and every action a ^ A^. Therefore, every decision 
on whether an action is switchable is always made using the bias equation. 
This implies that policy iteration for the average-reward criterion will behave in 
exactly the same way as policy iteration for the total reward criterion. □ 



